Appendix F — Assignment F

Model selection

Instructions

You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.
Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.
The assignment is worth 100 points, and is due on 21st November 2023 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:

Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
Final answers of each question are written clearly (1 pt).
The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

F.1 Energy model

The datasets ENB2012_Train.csv and ENB2012_Test.csv provide data on energy analysis using different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. Below is the description of the data columns:

X1: Relative Compactness
X2: Surface Area
X3: Wall Area
X4: Roof Area
X5: Overall Height
X6: Orientation
X7: Glazing Area
X8: Glazing Area Distribution
Y1: Heating Load

F.1.1 Number of models

Suppose that we want to implement the best subset selection algorithm to find the first order predictors (X1-X8) that can predict heating load (y1). How many models for \(E\)(Y1) are possible, if the model includes (i) one variable, (ii) three variables, and (iii) eight variables? Show your steps without running any code.

Note: The notation \(E\)(Y1) means the expected value of Y1 or the mean of y1.

(3 points)

F.1.2 Resolving perfect multicollinearity

Fit a linear regression model and attempt to find the VIF of each predictor. It will throw an error. This is because of the presence of perfect multicollinearity in the predictors. Identify the perfectly collinear relationship, and discard a predictor to eliminate perfect multicollinearity. Among the collinear predictors, discard the predictor that is the most highly correlated with the one of the predictors not in the set of collinear predictors. State the discarded predictor, and discard it for all the questions in this assignment.

Hint:

You may use the R function alias() to identify the perfectly collinear relationship. Check out an example of this function here.

(2+2 = 4 points)

F.1.3 Best model(s)

Implement the best subset selection algorithm to find the best first-order predictors of heating load Y1, with respect to \(R^2\), Adj-\(R^2\), \(C_p\), and the \(BIC\) criteria.

Note:

Use ENB2012_Train.csv and consider only the first-order terms.
Make plots visualizing the \(R^2\), Adj-\(R^2\), \(C_p\), and the \(BIC\) criteria for the best model corresponding to each possible value of number of predictors.

Print the coefficients of all the models that are the best based on at least one criteria among Adj-\(R^2\), \(C_p\), and \(BIC\).

(2 + 4 + 2 = 8 points)

F.1.4 Best among best?

We need to choose one among all the best models identified in the previous question. Use the following approach:

Compare the coefficients of each model when fitted on the training data with the coefficients when fitted on the test data. If the coefficients are similar, it means that the model is generalizable to unseen data. Otherwise, the model has issues, and you may discard it.
Check the prediction error (root mean squared prediction error) on the test data. If the predictor error of one model is much lower than the rest, then you may select that model.
Check the model assumptions regarding homoscedasticity and normal distribution of error terms using statistical tests. You may select a model if it has much lesser departure from assumptions as compared to the rest of them.
If there is no clear choice based on (1), (2), and (3), perform the general linear F test (with a significance levlel of 5%) to choose the most parsimonious model.

Print the coefficients of the chosen model.

(2x4 = 8 points)

F.1.5 Overfitting

Compute the estimate of the error standard deviation. Compare it with the RMSE on test data, and comment if the chosen model in the previous question (F.1.4) seems to overfit the data.

(2 + 2 = 4 points)

F.1.6 Best model vs all-predictor model

Calculate the RMSE of the model selected in F.1.4. Compare it with the RMSE of the model using all first-order predictors (except the one dropped in F.1.2 to address perfect multicollinearity). You will find that the two RMSEs are similar. Seems like the best subset model didn’t help improve prediction.

Why did the best subset model not help improve prediction accuracy as compared to the model with all the predictors?
Does the best subset model provide a more accurate inference as compared to the model with all the predictors? Why or why not?

(3 + 3 = 6 points)

F.1.7 Two-factor Interactions

Let us consider adding all the 2-factor interactions of all the predictors in the model (except the one dropped in F.1.2 to address perfect multicollinearity). Answer the following questions without running code.

How many predictors do we have in total?
Assume best subset selection is used. How many models are fitted in total?
Assume forward selection is used. How many models are fitted in total?
Assume backward selection is used. How many models are fitted in total?
How many models will be developed in the iteration that contains exactly 10 predictors in each model – for best subsets, fwd/bwd regression?
What approach would you choose for variable selection (amonst best subset, forward selection, backward selection)?

(6x2 = 12 points)

F.1.8 Interaction model

Use best subset selection to find the best first-order predictors and 2-factor interactions of the predictors of Y1 (Heating Load). In case different criteria indicate different models as best, choose the most parsimonious model.

Perform the general linear test to show that the model with interactions should be preferred over the additive model selected earlier (in F.1.4).

Find the RMSE of the best model with interactions on the test data. Is it lower than the RMSE of the additive model (selected earlier in F.1.4)?

In case of numerical issues, check if there is perfect multicollinearity and remove the corresponding interaction(s) to eliminate the issue.

Hint: The R syntax to include 2 factor interactions for all the predictors in the dataset is:

model <- lm(Y~.^2, data = data)

where Y is the response, and data is the dataset that consists of all the predictors and the response.

If you wish to remove a certain interaction from the above model, say the interaction between x1 and x2, you can use the following syntax:

model <- lm(Y~.^2-x1:x2, data = data)

(2 + 2 + 2 + 2 = 8 points)

F.1.9 Polynomial terms

In the model developed in the previous question (F.1.8), consider including the polynomial transformation of degree 2 for each predictor, and then repeat the same procedure:

Use best subset selection to find the best first-order & second order predictors including 2-factor interactions of the predictors to predict Y1 (Heating Load). In case different criteria indicate different models as best, choose the most parsimonious model.

Perform the general linear test to show that the model with interactions & polynomial terms of degree 2 should be preferred over the model with interactions but no polynomial terms developed in the previous question (F.1.8).

Find the RMSE of the best model with 2-factor interactions and polynomial terms of degree 2, on the test data. Is it lower than the RMSE of the model with 2-factor interactions but no polynomial terms developed in the previous question (F.1.8)?

In case of numerical issues, check if there is perfect multicollinearity and remove the corresponding polynomial term(s) to eliminate the issue.

Hint: The R syntax to include the polynomial transformation of degree 2 of say, the predictor x1, in an additive model is:

model <- lm(Y~.+I(x1^2), data = data)

(2 + 2 + 2 + 2 = 8 points)

F.1.10 Forward stepwise selection

Repeat the above question, but use forward stepwise selection instead of best subset selection. Compare the time taken in forward stepwise with that taken in best subset selection in the previous question (F.1.9). Which approach is faster, and by what factor?

Also, compare the RMSE on test data of the model chosen with the best subset selection approach (F.1.9 with the model chosen with the forward stepwise selection procedure. Which approach seems to provide a more accurate model?

(2 + 2 = 4 points)

F.1.11 Backward stepwise selection

Repeat the above question, but use backward stepwise selection instead of best subset selection. Compare the time taken in backward stepwise with that taken in best subset selection in the earlier question (F.1.9). Which approach is faster, and by what factor?

Also, compare the RMSE on test data of the model chosen with the best subset selection approach (F.1.9 with the model chosen with the backward stepwise selection procedure. Which approach seems to provide a more accurate model?

(2 + 2 = 4 points)

F.2 Forward vs backward vs best subset approach

Mention one advantage of each of the 3 approaches (forward stepwise selection, backward stepwise selection, best subset selection) over the other 2 approaches.

(2x3 = 6 points)

F.3 K-fold cross validation approach

We will use a new approach to develop a model for predicting Y1. The approach is based on considering terms based on the effect hierarchy principle and assessing if a term should be added to the model based on the K-fold cross-validation error.

In this question, you will use \(K\)-fold cross validation to find out the relevant interactions of these predictors and the relevant interactions of the polynomial transformations of these predictors for predicting Y1. We’ll consider quadratic and cubic transfromations of each predictor, and the interactions of these transformed predictors. For example, some of the interaction terms that you will consider are (X1\(^2\)), (X1)(X3), (X1\(^2\))(X3), (X1)(X3)(X4), (X1)(X3\(^2\))(X4), (X1)(X3\(^2\))(X4)(X5\(^3\)), etc. The highest degree interaction term will be of degree 21 - (X1\(^3\))(X3\(^3\))(X4\(^3\))(X5\(^3\))(X6\(^3\))(X7\(^3\))(X8\(^3\)), and the lowest degree interaction terms will be of degree two, such as (X1\(^2\)), (X1)(X2), etc.

The algorithm to find out the relevant interactions using \(K\)-fold cross validation is as follows. Most of the algorithm is already coded for you. You need to code only part 2 as indicated below.

Start with considering interactions of degree d = 2:
Find out the 10-fold cross validation RMSE if an interaction of degree d is added to the model (You need to code only this part).
Repeat step (2) for all possible interactions of degree d.
Include the interaction of degree d in the model that leads to the highest reduction in the 5-fold cross validation error as compared to the previous model (forward stepwise selection based on K-fold cross validation)
Repeat steps 2-4 until no more reduction is possible in the 10-fold cross validation RMSE.
If d = 21, or if no term of a particular degree could reduce the \(K\)-fold cross-validation error further, then stop, otherwise increment d by one, and repeat steps 2-5.

The above algorithm is coded below. The algorithm calls a function KFoldCV to compute the 10-fold cross validation RMSE given the interaction terms already selected in the model as selected_interactions, and the interaction term to be tested as interaction_being_tested. The function must return the 10-fold cross validation RMSE if interaction_being_tested is included to the model consisting of X1, X3, X4, X5, X6, X7, and X8 in addition to the already added interactions in selected_interactions. The model equation for which you need to find the \(10\)-fold cross validation RMSE will be'price~X1+X3+X4+X5+X6+X7+X8'+selected_interactions+interaction_being_tested

You need to do the following:

Fill out the function KFoldCV.
Execute the code to obtain the relevant interactions in selected_interactions. Print out the object selected_interactions.
Fit the model with the seven predictors, the selected_interactions and compute the RMSE on test data.

(15 + 2 + 3 = 20 points)

# Creating a dataframe that will consist of all combinations of polynomial transformations of the 
# predictors to be considered for interactions
library(AlgDesign)
degree <- gen.factorial(4, 7,center = FALSE,
  varNames=c("X1","X3", "X4","X5","X6", "X7","X8"))
degree <- degree-1
degree$sum_degree <- apply(degree,1,sum)
degree$count_zeros <- apply(degree,1,FUN = function(x) sum(x[1:7]==0))
degree <- degree[order(degree$sum_degree,-degree$count_zeros ),]
degree <- degree[-1,]
degree <- degree[,-9]
head(degree)

k=10
fold_size = round(nrow(ENB_train)/k)

# Fill out this function - that is all you need to do to make the code work!

# The function must return the mean 10-fold cross validation RMSE for the model
# that has the 7 individual predictors - 'X1', 'X3', 'X4', 'X5', 'X6', 'X7', and 'X8',
# the 'selected_interactions', and the 'interaction_being_tested'

KFoldCV <- function(selected_interactions, interaction_being_tested)
{
  # 
  # for(i in 1:k)
  # {
   
    # formulation <- paste0("Y1~.",selected_interactions,interaction_being_tested)
    # model <- lm(noquote(formulation), data = train_set)
    
  # }
  # return()
}

# This code implements the algorithm of systematically considering interactions of degree 2 and going upto the interaction of degree 21 For a given degree 'd' the interactions are selected greedily based on highest reduction in the 10-fold cross validation RMSE. Once no more reduction in the 10-fold cross validation RMSE is possible using interactions of degree 'd', interaction terms of the next higher degree 'd+1' are considered. If there is no interaction of order 'd' that reduces the 10-fold cross validation RMSE, the procedure is stopped.

# 10-fold cross validation RMSE of the initial model with the 7 predictors of degree one
cv_previous_model = KFoldCV(selected_interactions = '', interaction_being_tested = '')
pred_names<-colnames(degree)
interaction_being_tested = '+'
selected_interactions = ''

# Considering interactions of degree 'd' = 2 to 21
for(d in 2:21)
{
  d_terms = 0
  
  # Initializing objects to store the interactions of degree 'd' that reduce the 10-fold cross validation RMSEs as compared to the previous model
  interactions_that_reduce_KfoldCV<-c()
  cv_degree <- c()
  
  # Selecting interaction terms of degree = 'd'
  degree_set = degree[degree$sum_degree==d,]
  
  # Continue adding interactions of degree 'd' in the model until no interactions reduce the 5-fold cross-validation RMSE
  while(TRUE)  
  {
    
    #Iterating over all possible interactions of degree 'd'
    for(j in 1:nrow(degree_set))
    {
      
      # Creating the formula expression for the interaction term to be tested
      row = degree_set[j,1:7]
      for(m in 1:7)
      {
        if(row[m]>1)
        {
          interaction_being_tested <- paste0(interaction_being_tested, "I(", pred_names[m],"^",row[m],")*")
        }else if(row[m]==1)
        {
          interaction_being_tested <- paste0(interaction_being_tested, pred_names[m],"*")
        }
      }
      interaction_being_tested<-substr(interaction_being_tested,1,nchar(interaction_being_tested)-1)
      
      # Call the function 'KFoldCV' to find out the 10-fold cross validation error on adding the interaction term being tested to the model
      cv <- KFoldCV(selected_interactions, interaction_being_tested)
      
      # If the interaction term being tested reduces the 10-fold cross validation RMSE as compared to the previous model, then consider adding it to the model
      if(cv<cv_previous_model)
      {
        interactions_that_reduce_KfoldCV<-c(interactions_that_reduce_KfoldCV, interaction_being_tested) 
        cv_degree<-c(cv_degree, cv)
      }
      interaction_being_tested = '+'
    }
    cv_data = data.frame(interaction = interactions_that_reduce_KfoldCV, cv = cv_degree)
    
    # Break the loop if no interaction of degree 'd' reduces the 5-fold cross validation RMSE as compared to the previous model
    if(nrow(cv_data)==0)
    {
      break
    }else{d_terms<-1}
    
    # Sort the interaction terms that reduce the 10-fold cross valdiation RMSE based on their respective 10-fold cross validation RMSE
    cv_data <- cv_data[order(cv_data$cv),]
    
    # Select the interaction that corresponds to the least 10-fold cross validation RMSE
    selected_interactions = paste0(selected_interactions, cv_data[1,1])
    cv_previous_model = cv_data[1,2]
    interactions_that_reduce_KfoldCV<-c()
    cv_degree <- c()
    
    
    # Print the progress after each model update, i.e., after an interaction term is selected
    print(paste0("Degree of interactions being considered:",d, ", 5-fold CV RMSE:", cv_previous_model))
  }
  if(d_terms==0){break}
}